{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.011108,
     "end_time": "2021-04-25T19:21:30.744526",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.733418",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "\n",
    "\n",
    "This notebook contains an example for teaching.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_execution_state": "idle",
    "_uuid": "051d70d956493feee0c6d64651c6a088724dca2a",
    "papermill": {
     "duration": 0.009798,
     "end_time": "2021-04-25T19:21:30.764874",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.755076",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "# A Simple Case Study using Wage Data from 2015 - proceeding"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.009841,
     "end_time": "2021-04-25T19:21:30.784643",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.774802",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "So far we considered many machine learning method, e.g Lasso and Random Forests, to build a predictive model. In this lab, we extend our toolbox by predicting wages by a neural network."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.009823,
     "end_time": "2021-04-25T19:21:30.804349",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.794526",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "## Data preparation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.009956,
     "end_time": "2021-04-25T19:21:30.824442",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.814486",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.223372,
     "end_time": "2021-04-25T19:21:31.057720",
     "exception": false,
     "start_time": "2021-04-25T19:21:30.834348",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "load(\"../input/wage2015-inference/wage2015_subsample_inference.Rdata\")\n",
    "Z <- subset(data,select=-c(lwage,wage)) # regressors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010696,
     "end_time": "2021-04-25T19:21:31.078556",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.067860",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Firt, we split the data first and normalize it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.065685,
     "end_time": "2021-04-25T19:21:31.154228",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.088543",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "set.seed(1234)\n",
    "training <- sample(nrow(data), nrow(data)*(3/4), replace=FALSE)\n",
    "\n",
    "data_train <- data[training,1:16]\n",
    "data_test <- data[-training,1:16]\n",
    "\n",
    "# data_train <- data[training,]\n",
    "# data_test <- data[-training,]\n",
    "# X_basic <-  \"sex + exp1 + exp2+ shs + hsg+ scl + clg + mw + so + we + occ2+ ind2\"\n",
    "# formula_basic <- as.formula(paste(\"lwage\", \"~\", X_basic))\n",
    "# model_X_basic_train <- model.matrix(formula_basic,data_train)[,-1]\n",
    "# model_X_basic_test <- model.matrix(formula_basic,data_test)[,-1]\n",
    "# data_train <- as.data.frame(cbind(data_train$lwage,model_X_basic_train))\n",
    "# data_test <- as.data.frame(cbind(data_test$lwage,model_X_basic_test))\n",
    "# colnames(data_train)[1]<-'lwage'\n",
    "# colnames(data_test)[1]<-'lwage'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.046693,
     "end_time": "2021-04-25T19:21:31.211056",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.164363",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# normalize the data\n",
    "mean <- apply(data_train, 2, mean)\n",
    "std <- apply(data_train, 2, sd)\n",
    "data_train <- scale(data_train, center = mean, scale = std)\n",
    "data_test <- scale(data_test, center = mean, scale = std)\n",
    "data_train <- as.data.frame(data_train)\n",
    "data_test <- as.data.frame(data_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010017,
     "end_time": "2021-04-25T19:21:31.231339",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.221322",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Then, we construct the inputs for our network."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.039516,
     "end_time": "2021-04-25T19:21:31.280916",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.241400",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "X_basic <-  \"sex + exp1 + shs + hsg+ scl + clg + mw + so + we\"\n",
    "formula_basic <- as.formula(paste(\"lwage\", \"~\", X_basic))\n",
    "model_X_basic_train <- model.matrix(formula_basic,data_train)\n",
    "model_X_basic_test <- model.matrix(formula_basic,data_test)\n",
    "\n",
    "Y_train <- data_train$lwage\n",
    "Y_test <- data_test$lwage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010133,
     "end_time": "2021-04-25T19:21:31.301307",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.291174",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "### Neural Networks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010107,
     "end_time": "2021-04-25T19:21:31.321796",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.311689",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "First, we need to determine the structure of our network. We are using the R package *keras* to build a simple sequential neural network with three dense layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 1.032322,
     "end_time": "2021-04-25T19:21:32.364350",
     "exception": false,
     "start_time": "2021-04-25T19:21:31.332028",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "library(keras)\n",
    "\n",
    "build_model <- function() {\n",
    "  model <- keras_model_sequential() %>% \n",
    "    layer_dense(units = 20, activation = \"relu\", \n",
    "                input_shape = dim(model_X_basic_train)[2])%>% \n",
    "    layer_dense(units = 10, activation = \"relu\") %>% \n",
    "    layer_dense(units = 1) \n",
    "  \n",
    "  model %>% compile(\n",
    "    optimizer = optimizer_adam(lr = 0.005),\n",
    "    loss = \"mse\", \n",
    "    metrics = c(\"mae\")\n",
    "  )\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.009972,
     "end_time": "2021-04-25T19:21:32.384768",
     "exception": false,
     "start_time": "2021-04-25T19:21:32.374796",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Let us have a look at the structure of our network in detail."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 9.011687,
     "end_time": "2021-04-25T19:21:41.406513",
     "exception": false,
     "start_time": "2021-04-25T19:21:32.394826",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "model <- build_model()\n",
    "summary(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010737,
     "end_time": "2021-04-25T19:21:41.429065",
     "exception": false,
     "start_time": "2021-04-25T19:21:41.418328",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "It is worth to notice that we have in total $441$ trainable parameters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010333,
     "end_time": "2021-04-25T19:21:41.449833",
     "exception": false,
     "start_time": "2021-04-25T19:21:41.439500",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "Now, let us train the network. Note that this takes some computation time. Thus, we are using gpu to speed up. The exact speed-up varies based on a number of factors including model architecture, batch-size, input pipeline complexity, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 85.690139,
     "end_time": "2021-04-25T19:23:07.150464",
     "exception": false,
     "start_time": "2021-04-25T19:21:41.460325",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# training the network \n",
    "num_epochs <- 1000\n",
    "model %>% fit(model_X_basic_train, Y_train,\n",
    "                    epochs = num_epochs, batch_size = 100, verbose = 0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "papermill": {
     "duration": 0.010853,
     "end_time": "2021-04-25T19:23:07.177811",
     "exception": false,
     "start_time": "2021-04-25T19:23:07.166958",
     "status": "completed"
    },
    "tags": []
   },
   "source": [
    "After training the neural network, we can evaluate the performance of our model on the test sample."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.223091,
     "end_time": "2021-04-25T19:23:07.411713",
     "exception": false,
     "start_time": "2021-04-25T19:23:07.188622",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# evaluating the performnace\n",
    "model %>% evaluate(model_X_basic_test, Y_test, verbose = 0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "papermill": {
     "duration": 0.266966,
     "end_time": "2021-04-25T19:23:07.690522",
     "exception": false,
     "start_time": "2021-04-25T19:23:07.423556",
     "status": "completed"
    },
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Calculating the performance measures\n",
    "pred.nn <- model %>% predict(model_X_basic_test)\n",
    "MSE.nn = summary(lm((Y_test-pred.nn)^2~1))$coef[1:2]\n",
    "R2.nn <- 1-MSE.nn[1]/var(Y_test)\n",
    "# printing R^2\n",
    "cat(\"R^2 of the neural network:\",R2.nn)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "R",
   "language": "R",
   "name": "ir"
  },
  "language_info": {
   "codemirror_mode": "r",
   "file_extension": ".r",
   "mimetype": "text/x-r-source",
   "name": "R",
   "pygments_lexer": "r",
   "version": "3.6.3"
  },
  "papermill": {
   "default_parameters": {},
   "duration": 101.688052,
   "end_time": "2021-04-25T19:23:09.794139",
   "environment_variables": {},
   "exception": null,
   "input_path": "__notebook__.ipynb",
   "output_path": "__notebook__.ipynb",
   "parameters": {},
   "start_time": "2021-04-25T19:21:28.106087",
   "version": "2.2.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
